pytorch - GPU

Lecture 24

Dr. Colin Rundel

CUDA

CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.


Core libraries:

  • cuBLAS

  • cuSOLVER

  • cuSPARSE

  • cuFFT

  • cuTENSOR

  • cuRAND

  • Thrust

  • cuDNN

CUDA Kernels

// Kernel - Adding two matrices MatA and MatB
__global__ void MatAdd(float MatA[N][N], float MatB[N][N], float MatC[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        MatC[i][j] = MatA[i][j] + MatB[i][j];
}
 
int main()
{
    ...
    // Matrix addition kernel launch from host code
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(
        (N + threadsPerBlock.x -1) / threadsPerBlock.x, 
        (N+threadsPerBlock.y -1) / threadsPerBlock.y
    );
    
    MatAdd<<<numBlocks, threadsPerBlock>>>(MatA, MatB, MatC);
    ...
}

GPU Status

nvidia-smi
Mon Apr 10 17:00:16 2023      
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB            Off| 00000000:02:00.0 Off |                    0 |
| N/A   39C    P0               31W / 250W|   5704MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB            Off| 00000000:03:00.0 Off |                    0 |
| N/A   40C    P0               33W / 250W|   4584MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                        
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1706923      C   python                                     5702MiB |
|    1   N/A  N/A   1706923      C   python                                     4582MiB |
+---------------------------------------------------------------------------------------+

Torch GPU Information

torch.cuda.is_available()
True
torch.cuda.device_count()
2
torch.cuda.get_device_name("cuda:0")
'Tesla P100-PCIE-16GB'
torch.cuda.get_device_name("cuda:1")
'Tesla P100-PCIE-16GB'
torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16276MB, multi_processor_count=56)
torch.cuda.get_device_properties(1)
_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16276MB, multi_processor_count=56)

GPU Tensors

Usage of the GPU is governed by the location of the Tensors - to use the GPU we allocate them on the GPU device.

cpu = torch.device('cpu')
cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')

x = torch.linspace(0,1,5, device=cuda0); x
tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000], device='cuda:0')
y = torch.randn(5,2, device=cuda0); y
tensor([[ 0.3486, -1.5303],
        [ 1.5092,  0.4547],
        [ 0.1101,  0.2915],
        [-1.1067,  1.8715],
        [-0.3181, -0.6750]], device='cuda:0')
z = torch.rand(2,3, device=cpu); z
tensor([[0.1588, 0.1972, 0.1991],
        [0.9357, 0.3709, 0.0200]])
x @ y
tensor([-0.7158,  0.9881], device='cuda:0')
y @ z
Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
y @ z.to(cuda0)
tensor([[-1.3766, -0.4988,  0.0387],
        [ 0.6651,  0.4662,  0.3097],
        [ 0.2902,  0.1298,  0.0278],
        [ 1.5755,  0.4759, -0.1829],
        [-0.6821, -0.3131, -0.0769]], device='cuda:0')

NN Layers + GPU

NN layers (parameters) also need to be assigned to the GPU to be used with GPU tensors,

nn = torch.nn.Linear(5,5)
X = torch.randn(10,5).cuda()
nn(X)
Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
nn.cuda()(X)
tensor([[-0.3230,  0.0227,  0.3069,  0.2658,  0.8560],
        [-1.4155, -0.8198,  0.1377,  0.1836, -1.0994],
        [-0.3348, -0.3524, -0.9176,  0.6833,  0.2793],
        [-0.3136, -0.0428, -0.0252,  0.5827,  0.6839],
        [-0.3300, -0.2476, -0.1472, -0.0570,  0.6069],
        [-0.3694, -0.1916,  0.0281, -0.6545,  0.5931],
        [-0.5062,  0.2352, -0.0920, -1.2134,  1.1143],
        [ 0.7346,  0.2651,  1.3730, -0.7072,  1.1413],
        [-0.6905, -0.3887,  0.1515, -0.7117, -0.0063],
        [-0.3983, -0.4820, -0.5694, -0.3195,  0.3031]], device='cuda:0',
       grad_fn=<AddmmBackward0>)
nn.to(device="cuda")(X)
tensor([[-0.3230,  0.0227,  0.3069,  0.2658,  0.8560],
        [-1.4155, -0.8198,  0.1377,  0.1836, -1.0994],
        [-0.3348, -0.3524, -0.9176,  0.6833,  0.2793],
        [-0.3136, -0.0428, -0.0252,  0.5827,  0.6839],
        [-0.3300, -0.2476, -0.1472, -0.0570,  0.6069],
        [-0.3694, -0.1916,  0.0281, -0.6545,  0.5931],
        [-0.5062,  0.2352, -0.0920, -1.2134,  1.1143],
        [ 0.7346,  0.2651,  1.3730, -0.7072,  1.1413],
        [-0.6905, -0.3887,  0.1515, -0.7117, -0.0063],
        [-0.3983, -0.4820, -0.5694, -0.3195,  0.3031]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

Back to MNIST

Same MNIST data from last time (1x8x8 images),

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, shuffle=True, random_state=1234
)

X_train = torch.from_numpy(X_train).float()
y_train = torch.from_numpy(y_train)
X_test = torch.from_numpy(X_test).float()
y_test = torch.from_numpy(y_test)

To use the GPU for computation we need to copy these tensors to the GPU,

X_train_cuda = X_train.to(device=cuda0)
y_train_cuda = y_train.to(device=cuda0)
X_test_cuda = X_test.to(device=cuda0)
y_test_cuda = y_test.to(device=cuda0)

Convolutional NN

class mnist_conv_model(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        
        self.model = torch.nn.Sequential(
          torch.nn.Unflatten(1, (1,8,8)),
          torch.nn.Conv2d(
            in_channels=1, out_channels=8,
            kernel_size=3, stride=1, padding=1
          ),
          torch.nn.ReLU(),
          torch.nn.MaxPool2d(kernel_size=2),
          torch.nn.Flatten(),
          torch.nn.Linear(8 * 4 * 4, 10)
        ).to(device=self.device)
        
    def forward(self, X):
        return self.model(X)
    
    def fit(self, X, y, lr=0.001, n=1000, acc_step=10):
      opt = torch.optim.SGD(self.parameters(), lr=lr, momentum=0.9) 
      losses = []
      for i in range(n):
          opt.zero_grad()
          loss = torch.nn.CrossEntropyLoss()(self(X), y)
          loss.backward()
          opt.step()
          losses.append(loss.item())
      
      return losses
    
    def accuracy(self, X, y):
      val, pred = torch.max(self(X), dim=1)
      return( (pred == y).sum() / len(y) )

CPU vs Cuda

m = mnist_conv_model(device="cpu")
loss = m.fit(X_train, y_train, n=1000)
loss[-1]
0.038394346833229065
m.accuracy(X_test, y_test)
tensor(0.9694)
m_cuda = mnist_conv_model(device="cuda")
loss = m_cuda.fit(X_train_cuda, y_train_cuda, n=1000)
loss[-1]
0.03704439476132393
m_cuda.accuracy(X_test_cuda, y_test_cuda)
tensor(0.9778, device='cuda:0')

Performance

CPU performance:

m = mnist_conv_model(device="cpu")

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
loss = m.fit(X_train, y_train, n=1000)
end.record()

torch.cuda.synchronize()
print(start.elapsed_time(end) / 1000) 
2.52319873046875

GPU performance:

m_cuda = mnist_conv_model(device="cuda")

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
loss = m_cuda.fit(X_train_cuda, y_train_cuda, n=1000)
end.record()

torch.cuda.synchronize()
print(start.elapsed_time(end) / 1000) 
2.355713623046875

Profiling CPU

m = mnist_conv_model(device="cpu")
with torch.autograd.profiler.profile(with_stack=True, profile_memory=True) as prof_cpu:
    tmp = m(X_train)
print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        47.81%       1.214ms        48.72%       1.237ms       1.237ms       2.81 Mb           0 b             1  
    aten::max_pool2d_with_indices        27.29%     693.000us        27.29%     693.000us     693.000us       2.10 Mb       2.10 Mb             1  
                  aten::clamp_min        13.82%     351.000us        13.82%     351.000us     351.000us       2.81 Mb       2.81 Mb             1  
                      aten::addmm         3.62%      92.000us         5.00%     127.000us     127.000us      56.13 Kb      56.13 Kb             1  
                      aten::copy_         1.22%      31.000us         1.22%      31.000us      31.000us           0 b           0 b             1  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.539ms

Profiling GPU

m_cuda = mnist_conv_model(device="cuda")
with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(X_train_cuda)
print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          aten::cudnn_convolution        84.14%       1.799ms        85.36%       1.825ms       1.825ms             1  
                                      aten::addmm         3.46%      74.000us         4.49%      96.000us      96.000us             1  
                                 cudaLaunchKernel         2.99%      64.000us         2.99%      64.000us       9.143us             7  
                                  aten::clamp_min         1.45%      31.000us         1.82%      39.000us      39.000us             1  
                                aten::convolution         1.17%      25.000us        89.10%       1.905ms       1.905ms             1  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.138ms

CIFAR10


Loading the data

import torchvision

training_data = torchvision.datasets.CIFAR10(
    root="/data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
test_data = torchvision.datasets.CIFAR10(
    root="/data",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

CIFAR10 data

training_data.classes
['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
training_data.data.shape
(50000, 32, 32, 3)
test_data.data.shape
(10000, 32, 32, 3)
training_data[0]
(tensor([[[0.2314, 0.1686, 0.1961,  ..., 0.6196, 0.5961, 0.5804],
         [0.0627, 0.0000, 0.0706,  ..., 0.4824, 0.4667, 0.4784],
         [0.0980, 0.0627, 0.1922,  ..., 0.4627, 0.4706, 0.4275],
         ...,
         [0.8157, 0.7882, 0.7765,  ..., 0.6275, 0.2196, 0.2078],
         [0.7059, 0.6784, 0.7294,  ..., 0.7216, 0.3804, 0.3255],
         [0.6941, 0.6588, 0.7020,  ..., 0.8471, 0.5922, 0.4824]],

        [[0.2431, 0.1804, 0.1882,  ..., 0.5176, 0.4902, 0.4863],
         [0.0784, 0.0000, 0.0314,  ..., 0.3451, 0.3255, 0.3412],
         [0.0941, 0.0275, 0.1059,  ..., 0.3294, 0.3294, 0.2863],
         ...,
         [0.6667, 0.6000, 0.6314,  ..., 0.5216, 0.1216, 0.1333],
         [0.5451, 0.4824, 0.5647,  ..., 0.5804, 0.2431, 0.2078],
         [0.5647, 0.5059, 0.5569,  ..., 0.7216, 0.4627, 0.3608]],

        [[0.2471, 0.1765, 0.1686,  ..., 0.4235, 0.4000, 0.4039],
         [0.0784, 0.0000, 0.0000,  ..., 0.2157, 0.1961, 0.2235],
         [0.0824, 0.0000, 0.0314,  ..., 0.1961, 0.1961, 0.1647],
         ...,
         [0.3765, 0.1333, 0.1020,  ..., 0.2745, 0.0275, 0.0784],
         [0.3765, 0.1647, 0.1176,  ..., 0.3686, 0.1333, 0.1333],
         [0.4549, 0.3686, 0.3412,  ..., 0.5490, 0.3294, 0.2824]]]), 6)

Example data

Data Loaders

batch_size = 100

training_loader = torch.utils.data.DataLoader(
    training_data, 
    batch_size=batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

test_loader = torch.utils.data.DataLoader(
    test_data, 
    batch_size=batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

Loader generator

training_loader
<torch.utils.data.dataloader.DataLoader object at 0x7fad40088250>
X, y = next(iter(training_loader))
X.shape
torch.Size([100, 3, 32, 32])
y.shape
torch.Size([100])

CIFAR CNN

class cifar_conv_model(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        self.epoch = 0
        self.model = torch.nn.Sequential(
            torch.nn.Conv2d(3, 6, kernel_size=5),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2, 2),
            torch.nn.Conv2d(6, 16, kernel_size=5),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(2, 2),
            torch.nn.Flatten(),
            torch.nn.Linear(16 * 5 * 5, 120),
            torch.nn.ReLU(),
            torch.nn.Linear(120, 84),
            torch.nn.ReLU(),
            torch.nn.Linear(84, 10)
        ).to(device=self.device)
        
    def forward(self, X):
        return self.model(X)
    
    def fit(self, loader, epochs=10, n_report=250, lr=0.001):
        opt = torch.optim.SGD(self.parameters(), lr=lr, momentum=0.9) 
      
        for j in range(epochs):
            running_loss = 0.0
            for i, (X, y) in enumerate(loader):
                X, y = X.to(self.device), y.to(self.device)
                opt.zero_grad()
                loss = torch.nn.CrossEntropyLoss()(self(X), y)
                loss.backward()
                opt.step()
    
                # print statistics
                running_loss += loss.item()
                if i % n_report == (n_report-1):    # print every 100 mini-batches
                    print(f'[Epoch {self.epoch + 1}, Minibatch {i + 1:4d}] loss: {running_loss / n_report:.3f}')
                    running_loss = 0.0
            
            self.epoch += 1

CNN Performance - CPU (1 step)

X, y = next(iter(training_loader))

m_cpu = cifar_conv_model(device="cpu")
tmp = m_cpu(X)

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    tmp = m_cpu(X)
print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        62.50%       2.162ms        63.00%       2.179ms       1.089ms             2  
    aten::max_pool2d_with_indices        23.47%     812.000us        23.47%     812.000us     406.000us             2  
                      aten::addmm         5.15%     178.000us         6.45%     223.000us      74.333us             3  
                  aten::clamp_min         3.67%     127.000us         3.67%     127.000us      31.750us             4  
                      aten::copy_         1.16%      40.000us         1.16%      40.000us      13.333us             3  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.459ms

CNN Performance - GPU (1 step)

m_cuda = cifar_conv_model(device="cuda")
Xc, yc = X.to(device="cuda"), y.to(device="cuda")
tmp = m_cuda(Xc)
    
with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(Xc)
print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          aten::cudnn_convolution        67.70%       1.465ms        69.50%       1.504ms     752.000us             2  
                                       cudaMalloc         8.64%     187.000us         8.64%     187.000us     187.000us             1  
                                      aten::addmm         5.64%     122.000us         7.26%     157.000us      52.333us             3  
                                 cudaLaunchKernel         4.90%     106.000us         4.90%     106.000us       5.579us            19  
                                  aten::clamp_min         2.73%      59.000us        12.52%     271.000us      67.750us             4  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.164ms

CNN Performance - CPU (1 epoch)

m_cpu = cifar_conv_model(device="cpu")

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    m_cpu.fit(loader=training_loader, epochs=1, n_report=501)
print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             aten::convolution_backward        28.54%     960.874ms        28.97%     975.428ms     975.428us          1000  
                               aten::mkldnn_convolution        17.93%     603.607ms        18.27%     615.230ms     615.230us          1000  
                          aten::max_pool2d_with_indices         9.09%     305.950ms         9.10%     306.332ms     306.332us          1000  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...         8.32%     279.988ms         8.36%     281.519ms     561.914us           501  
                                Optimizer.step#SGD.step         6.04%     203.334ms         9.42%     317.231ms     634.462us           500  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.367s

CNN Performance - GPU (1 epoch)

m_cuda = cifar_conv_model(device="cuda")

with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    m_cuda.fit(loader=training_loader, epochs=1, n_report=501)
print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        11.71%     329.922ms        11.77%     331.521ms     661.719us           501  
                                       cudaLaunchKernel        11.51%     324.092ms        11.51%     324.245ms      11.581us         27998  
                                Optimizer.step#SGD.step         9.25%     260.523ms        12.27%     345.647ms     691.294us           500  
                                            aten::addmm         5.30%     149.358ms         7.23%     203.747ms     135.831us          1500  
                             aten::convolution_backward         4.91%     138.170ms         8.63%     242.937ms     242.937us          1000  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.816s

Loaders & Accuracy

def accuracy(model, loader, device):
    total, correct = 0, 0
    with torch.no_grad():
        for X, y in loader:
            X, y = X.to(device=device), y.to(device=device)
            pred = model(X)
            # the class with the highest energy is what we choose as prediction
            val, idx = torch.max(pred, 1)
            total += pred.size(0)
            correct += (idx == y).sum().item()
            
    return correct / total

Model fitting

m = cifar_conv_model("cuda")
m.fit(training_loader, epochs=10, n_report=500, lr=0.01)
## [Epoch 1, Minibatch  500] loss: 2.098
## [Epoch 2, Minibatch  500] loss: 1.692
## [Epoch 3, Minibatch  500] loss: 1.482
## [Epoch 4, Minibatch  500] loss: 1.374
## [Epoch 5, Minibatch  500] loss: 1.292
## [Epoch 6, Minibatch  500] loss: 1.226
## [Epoch 7, Minibatch  500] loss: 1.173
## [Epoch 8, Minibatch  500] loss: 1.117
## [Epoch 9, Minibatch  500] loss: 1.071
## [Epoch 10, Minibatch  500] loss: 1.035
accuracy(m, training_loader, "cuda")
## 0.63444
accuracy(m, test_loader, "cuda")
## 0.572

More epochs

If continue fitting with the existing model,

m.fit(training_loader, epochs=10, n_report=500)
## [Epoch 11, Minibatch  500] loss: 0.885
## [Epoch 12, Minibatch  500] loss: 0.853
## [Epoch 13, Minibatch  500] loss: 0.839
## [Epoch 14, Minibatch  500] loss: 0.828
## [Epoch 15, Minibatch  500] loss: 0.817
## [Epoch 16, Minibatch  500] loss: 0.806
## [Epoch 17, Minibatch  500] loss: 0.798
## [Epoch 18, Minibatch  500] loss: 0.787
## [Epoch 19, Minibatch  500] loss: 0.780
## [Epoch 20, Minibatch  500] loss: 0.773
accuracy(m, training_loader, "cuda")
## 0.73914
accuracy(m, test_loader, "cuda")
## 0.624

More epochs (again)

m.fit(training_loader, epochs=10, n_report=500)
## [Epoch 21, Minibatch  500] loss: 0.764
## [Epoch 22, Minibatch  500] loss: 0.756
## [Epoch 23, Minibatch  500] loss: 0.748
## [Epoch 24, Minibatch  500] loss: 0.739
## [Epoch 25, Minibatch  500] loss: 0.733
## [Epoch 26, Minibatch  500] loss: 0.726
## [Epoch 27, Minibatch  500] loss: 0.718
## [Epoch 28, Minibatch  500] loss: 0.710
## [Epoch 29, Minibatch  500] loss: 0.702
## [Epoch 30, Minibatch  500] loss: 0.698
accuracy(m, training_loader, "cuda")
## 0.76438
accuracy(m, test_loader, "cuda")
## 0.6217

The VGG16 model

class VGG16(torch.nn.Module):
    def make_layers(self):
        cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M']
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [torch.nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [torch.nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           torch.nn.BatchNorm2d(x),
                           torch.nn.ReLU(inplace=True)]
                in_channels = x
        layers += [
            torch.nn.AvgPool2d(kernel_size=1, stride=1),
            torch.nn.Flatten(),
            torch.nn.Linear(512,10)
        ]
        
        return torch.nn.Sequential(*layers).to(self.device)
    
    def __init__(self, device):
        super().__init__()
        self.device = torch.device(device)
        self.model = self.make_layers()
    
    def forward(self, X):
        return self.model(X)

Model

VGG16("cpu").model
Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (5): ReLU(inplace=True)
  (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (9): ReLU(inplace=True)
  (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (12): ReLU(inplace=True)
  (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (16): ReLU(inplace=True)
  (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (19): ReLU(inplace=True)
  (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (22): ReLU(inplace=True)
  (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (26): ReLU(inplace=True)
  (27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (29): ReLU(inplace=True)
  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (32): ReLU(inplace=True)
  (33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (36): ReLU(inplace=True)
  (37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (39): ReLU(inplace=True)
  (40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (42): ReLU(inplace=True)
  (43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (44): AvgPool2d(kernel_size=1, stride=1, padding=0)
  (45): Flatten(start_dim=1, end_dim=-1)
  (46): Linear(in_features=512, out_features=10, bias=True)
)

VGG16 performance - CPU

X, y = next(iter(training_loader))
m_cpu = VGG16(device="cpu")
tmp = m_cpu(X)

with torch.autograd.profiler.profile(with_stack=True) as prof_cpu:
    tmp = m_cpu(X)
print(prof_cpu.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::mkldnn_convolution        84.59%     175.236ms        84.83%     175.729ms      13.518ms            13  
          aten::native_batch_norm         7.86%      16.290ms         8.01%      16.596ms       1.277ms            13  
    aten::max_pool2d_with_indices         5.27%      10.922ms         5.27%      10.922ms       2.184ms             5  
                 aten::clamp_min_         1.24%       2.562ms         1.24%       2.562ms     197.077us            13  
                      aten::empty         0.33%     681.000us         0.33%     681.000us       5.238us           130  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 207.159ms

VGG16 performance - GPU

m_cuda = VGG16(device="cuda")
Xc, yc = X.to(device="cuda"), y.to(device="cuda")
tmp = m_cuda(Xc)

with torch.autograd.profiler.profile(with_stack=True) as prof_cuda:
    tmp = m_cuda(Xc)
print(prof_cuda.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          aten::cudnn_convolution        39.22%       2.952ms        58.64%       4.414ms     339.538us            13  
                                       cudaMalloc        18.61%       1.401ms        18.61%       1.401ms     200.143us             7  
                           aten::cudnn_batch_norm        13.33%       1.003ms        18.17%       1.368ms     105.231us            13  
                                 cudaLaunchKernel         7.40%     557.000us         7.40%     557.000us       5.018us           111  
                                       aten::add_         3.92%     295.000us         5.85%     440.000us      16.923us            26  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 7.527ms

VGG16 performance - Apple M1 GPU (mps)

m_mps = VGG16(device="mps")
Xm, ym = X.to(device="mps"), y.to(device="mps")

with torch.autograd.profiler.profile(with_stack=True) as prof_mps:
    tmp = m_mps(Xm)
print(prof_mps.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::native_batch_norm        35.71%       3.045ms        35.71%       3.045ms     234.231us            13  
          aten::_mps_convolution        19.67%       1.677ms        19.88%       1.695ms     130.385us            13  
    aten::_batch_norm_impl_index        11.92%       1.016ms        36.02%       3.071ms     236.231us            13  
                     aten::relu_        11.29%     963.000us        11.29%     963.000us      74.077us            13  
                      aten::add_        10.40%     887.000us        10.44%     890.000us      68.462us            13  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 8.526ms

Fitting w/ lr = 0.01

m = VGG16(device="cuda")
fit(m, training_loader, epochs=10, n_report=500, lr=0.01)

## [Epoch 1, Minibatch  500] loss: 1.345
## [Epoch 2, Minibatch  500] loss: 0.790
## [Epoch 3, Minibatch  500] loss: 0.577
## [Epoch 4, Minibatch  500] loss: 0.445
## [Epoch 5, Minibatch  500] loss: 0.350
## [Epoch 6, Minibatch  500] loss: 0.274
## [Epoch 7, Minibatch  500] loss: 0.215
## [Epoch 8, Minibatch  500] loss: 0.167
## [Epoch 9, Minibatch  500] loss: 0.127
## [Epoch 10, Minibatch  500] loss: 0.103
accuracy(model=m, loader=training_loader, device="cuda")
## 0.97008
accuracy(model=m, loader=test_loader, device="cuda")
## 0.8318

Fitting w/ lr = 0.001

m = VGG16(device="cuda")
fit(m, training_loader, epochs=10, n_report=500, lr=0.001)

## [Epoch 1, Minibatch  500] loss: 1.279
## [Epoch 2, Minibatch  500] loss: 0.827
## [Epoch 3, Minibatch  500] loss: 0.599
## [Epoch 4, Minibatch  500] loss: 0.428
## [Epoch 5, Minibatch  500] loss: 0.303
## [Epoch 6, Minibatch  500] loss: 0.210
## [Epoch 7, Minibatch  500] loss: 0.144
## [Epoch 8, Minibatch  500] loss: 0.108
## [Epoch 9, Minibatch  500] loss: 0.088
## [Epoch 10, Minibatch  500] loss: 0.063
accuracy(model=m, loader=training_loader, device="cuda")
## 0.9815
accuracy(model=m, loader=test_loader, device="cuda")
## 0.7816

Report

from sklearn.metrics import classification_report

def report(model, loader, device):
    y_true, y_pred = [], []
    with torch.no_grad():
        for X, y in loader:
            X = X.to(device=device)
            y_true.append( y.cpu().numpy() )
            y_pred.append( model(X).max(1)[1].cpu().numpy() )
    
    y_true = np.concatenate(y_true)
    y_pred = np.concatenate(y_pred)

    return classification_report(y_true, y_pred, target_names=loader.dataset.classes)

print(report(model=m, loader=test_loader, device="cuda"))

##               precision    recall  f1-score   support
## 
##     airplane       0.82      0.88      0.85      1000
##   automobile       0.95      0.89      0.92      1000
##         bird       0.85      0.70      0.77      1000
##          cat       0.68      0.74      0.71      1000
##         deer       0.84      0.83      0.83      1000
##          dog       0.81      0.73      0.77      1000
##         frog       0.83      0.92      0.87      1000
##        horse       0.87      0.87      0.87      1000
##         ship       0.89      0.92      0.90      1000
##        truck       0.86      0.93      0.89      1000
## 
##     accuracy                           0.84     10000
##    macro avg       0.84      0.84      0.84     10000
## weighted avg       0.84      0.84      0.84     10000

Hugging Face

Hugging Face

is an online community and platform for sharing machine learning models (architectures and weights), data, and related artifacts. They also maintain a number of packages and related training materials that help with building, training, and deploying ML models.

Some notable resources,

  • transformers - APIs and tools to easily download and train state-of-the-art (pretrained) transformer based models

  • diffusers - provides pretrained vision and audio diffusion models, and serves as a modular toolbox for inference and training

  • timm - a library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts

Stable Diffusion

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
).to("cuda")
prompt = "a picture of thomas bayes with a cat on his lap"
generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompt, generator=generator, num_inference_steps=20, num_images_per_prompt=6)
fit.images
[<PIL.Image.Image image mode=RGB size=512x512 at 0x7FB991D3B910>, <PIL.Image.Image image mode=RGB size=512x512 at
   0x7FB993365850>, <PIL.Image.Image image mode=RGB size=512x512 at 0x7FB9933D0D90>, <PIL.Image.Image image mode=RGB
   size=512x512 at 0x7FB993094110>, <PIL.Image.Image image mode=RGB size=512x512 at 0x7FB991D82410>, <PIL.Image.Image
   image mode=RGB size=512x512 at 0x7FBB53A9C6D0>]

Customizing prompts

prompt = "a picture of thomas bayes with a cat on his lap"
prompts = [
  prompt + t for t in 
  ["in the style of a japanese wood block print",
   "as a hipster with facial hair and glasses",
   "as a simpsons character, cartoon, yellow",
   "in the style of a vincent van gogh painting",
   "in the style of a picasso painting",
   "with flowery wall paper"
  ]
]

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompts, generator=generator, num_inference_steps=20, num_images_per_prompt=1)

Increasing inference steps

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(6)]
fit = pipe(prompts, generator=generator, num_inference_steps=50, num_images_per_prompt=1)

Stanford’s Alpaca + Facebook’s Llama

from transformers import GenerationConfig, LlamaTokenizer, LlamaForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("chainyo/alpaca-lora-7b")

model = LlamaForCausalLM.from_pretrained(
    "chainyo/alpaca-lora-7b",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

generation_config = GenerationConfig(
    temperature=0.2,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=128,
)

Generate a prompt

instruction = "Write a short childrens story about Thomas Bayes and his pet cat"
input_ctxt = None 
prompt = generate_prompt(instruction, input_ctxt)
print(prompt)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a short childrens story about Thomas Bayes and his pet cat

### Response:

Running the model

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )

response = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(response)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a short childrens story about Thomas Bayes and his pet cat

### Response:
Once upon a time, there was a little boy named Thomas Bayes. He had a pet cat named Fluffy, and 
they were the best of friends. One day, Thomas and Fluffy decided to go on an adventure. They 
traveled far and wide, exploring new places and meeting new people. Along the way, Thomas and 
Fluffy learned many valuable lessons, such as the importance of friendship and the joy of discovery.
Eventually, Thomas and Fluffy made their way back home, where they were welcomed with open arms. 
Thomas and Fluffy had a wonderful time.